10 August 2019
Fraud = scientifc misconduct.
Diederik Stapel, social psychologist. Suspended in 2011. Fabricating and manipulating data
Jens Förster, social psychologist. Resigned in 2017. Data tampering.
Coin termed by John, Loewenstein, & Prelec (2012).
See also Simmons, Nelson, & Simonsohn (2011).
Some examples (John et al., 2012; Schimmack, 2015):
Prof. Brian Wansink at Cornell University.
His description of the efforts of a visiting Ph.D student:
I gave her a data set of a self-funded, failed study which had null results (…). I said, “This cost us a lot of time and our own money to collect. There’s got to be something here we can salvage because it’s a cool (rich & unique) data set.” I had three ideas for potential Plan B, C, & D directions (since Plan A had failed). I told her what the analyses should be and what the tables should look like. I then asked her if she wanted to do them.
Every day she came back with puzzling new results, and every day we would scratch our heads, ask “Why,” and come up with another way to reanalyze the data with yet another set of plausible hypotheses. Eventually we started discovering solutions that held up regardless of how we pressure-tested them. I outlined the first paper, and she wrote it up (…). This happened with a second paper, and then a third paper (which was one that was based on her own discovery while digging through the data).
This isn’t creative, thinking outside the box, or worthy in any way.
This is QRPing.
Yes.
Interestingly, science misconduct has been a longtime concern (see Babbage, 1830).
There are also some voices against this state of affairs (e.g., Fiedler & Schwarz, 2016).
It is strongly related to incentives (Nosek, Spies, & Motyl, 2012; F. Schönbrodt, 2015a).
(Munafò et al., 2017)
Until very recently (Makel, Plucker, & Hegarty, 2012).
How poorly we build theory (see Gelman):
“It is not unusual that (e) this ad hoc challenging of auxiliary hypotheses is repeated in the course of a series of related experiments, in which the auxiliary hypothesis involved in Experiment 1 (…) becomes the focus of interest in Experiment 2, which in turn utilizes further plausible but easily challenged auxiliary hypotheses, and so forth. In this fashion a zealous and clever investigator can slowly wend his way through (…) a long series of related experiments (…) without ever once refuting or corroborating so much as a single strand of the network.”
Low-powered experiments:
“(…) It was found that the average power (probability of rejecting false null hypotheses) over the 70 research studies was .18 for small effects, .48 for medium effects, and .83 for large effects. These values are deemed to be far too small.”
“(…) it is recommended that investigators use larger sample sizes than they customarily do.”
Replicability of 13 classic and contemporary effects across 36 independent samples totaling 6,344 participants.
See also Many Labs 2 (Klein et al., 2018), Many Labs 3 (Ebersole et al., 2016).
A gazilion authors.
“The Basic and Applied Social Psychology (BASP) (…) emphasized that the null hypothesis significance testing procedure (NHSTP) is invalid (…). From now on, BASP is banning the NHSTP.”
Editorial (Harrington et al., 2019).
“(…) a requirement to replace \(p\) values with estimates of effects or association and 95% confidence intervals”
Research Master course on open science practices. Materials available (add link)!
Probability of an effect at least as extreme as the one we observed, given that \(\mathcal{H}_0\) is true.
\[\fbox{$ p\text{-value} = P\left(X_\text{obs} \text{ or more extreme}|\mathcal{H}_0\right) $}\]
The definition is simple enough, right?…
Consider the following statement (Falk & Greenbaum, 1995; Gigerenzer, Krauss, & Vitouch, 2004; Haller & Kraus, 2002; Oakes, 1986):
Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say, 20 subjects in each sample). Furthermore, suppose you use a simple independent means \(t\)-test and your result is significant (\(t = 2.7\), \(df = 18\), \(p = .01\)). Please mark each of the statements below as “true” or “false.” False means that the statement does not follow logically from the above premises. Also note that several or none of the statements may be correct.
All statements are incorrect.
But how did students and teachers perceive these statements? (see other plots)
This was in 2004. But things did not improve since…
This paper expands Goodman (2008) and elaborates on 25 (yes, 二十五!!!) misinterpretations.
Publication bias and QRPs (\(p\)-hacking) inflate \(p\)-values. Can we “deflate” them?
“\(p\)-curve is the distribution of statistically significant \(p\) values for a set of studies (\(ps < .05\)).”
See F. Schönbrodt (2015b) for a nice presentation. (add link)
“(…) it is hard to imagine a situation in which a dichotomous accept–reject decision is better than reporting an actual \(p\)-value or, better still, a confidence interval.”
A (say) 95% CI is a numerical interval found through a procedure that, if repeated across a series of hypothetical data, leads to an interval covering the true parameter 95% of the times.
Confused?
So is the vast majority of the social sciences population…
From rink’s paper, mimicking the \(p\) value study by Gigerenzer et al. (2004).
Rink’s Appendix 2. The setting
Rink’s Appendix 2. The 6 statements.
All statements are incorrect.
But how did students and teachers perceive these statements?
Put screenshot of rink’s table 1.
See also the ryan paper for extra confusion.
“If we were to repeat the experiment over and over, then 95% of the time the confidence intervals contain the true mean.” (rink’s paper)
How informative is this?!
Mental note:
Remember this when interpreting
Bayesian credible intervals in part 1 of today’s workshop!
For completeness, not everyone agrees with the Hoekstra study (miller and garcia-perez papers, and morey reply).
Six principles:
This is an editorial of a special issue consisting of 43 (!!) papers.
Main ideas:
“(…) it is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “\(p < 0.05\),” and “nonsignificant” survive, whether expressed in words, by asterisks in a table, or in some other way."
But:
“Despite the limitations of \(p\)-values (…), however, we are not recommending that the calculation and use of continuous \(p\)-values be discontinued. Where \(p\)-values are used, they should be reported as continuous quantities (e.g., \(p = 0.08\)). They should also be described in language stating what the value means in the scientific context.”
“What you will NOT find in this issue is one solution that majestically replaces the outsized role that statistical significance has come to play.”
Accept uncertainty (I cannot stress this enough!).
Be thoughtful, open, and modest.
Editorial, educational, and other institutional practices will have to change.
This includes: Journals, funding agencies, education, career system.
Value replicability, open materials and data, and reliable practices (which all take time) over “publish or perish”.
Methods:
Visit the Center for Open Science.
Prior to data collection (Chambers, 2013):
As of July 2019, 205 journals use Registered Reports. (add link)
To learn:
R package that can assist detecting statistical reporting of errors (Nuijten, Hartgerink, van Assen, Epskamp, & Wicherts, 2016).
(Interestingly: A recent comeback in Psychological Science.)
Lecoutre et al (2001) paper in rink’s CI literature
Agnoli, F., Wicherts, J. M., Veldkamp, C. L. S., Albiero, P., & Cubelli, R. (2017). Questionable research practices among italian research psychologists. PLOS ONE, 12(3), e0172792. doi: 10.1371/journal.pone.0172792
Babbage, C. (1830). Reflections on the Decline of Science in England: And on Some of Its Causes. http://www.gutenberg.org/files/1216/1216-h/1216-h.htm.
Chambers, C. (2013). Registered reports: A new publishing initiative at Cortex. Cortex; a Journal Devoted to the Study of the Nervous System and Behavior, 49(3), 609–610. doi: 10.1016/j.cortex.2012.12.016
Chambers, C. (2017a). Talks. doi: None
Chambers, C. (2017b). The seven deadly sins of psychology: A manifesto for reforming the culture of scientific practice. doi: 10.1515/9781400884940
Chambers, C., Feredoes, E., Muthukumaraswamy, S. D., & Etchells, P. (2014). Instead of "playing the game" it is time to change the rules: Registered Reports at AIMS Neuroscience and beyond. AIMS Neuroscience, 1, 4–17.
Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. The Journal of Abnormal and Social Psychology, 65(3), 145–153. doi: 10.1037/h0045186
Cuddy, A. J. C., Schultz, S. J., & Fosse, N. E. (2018). P-Curving a More Comprehensive Body of Research on Postural Feedback Reveals Clear Evidential Value for Power-Posing Effects: Reply to Simmons and Simonsohn (2017) - Amy J. C. Cuddy, S. Jack Schultz, Nathan E. Fosse, 2018. Psychological Science. doi: 10.1177/0956797617746749
Ebersole, C. R., Atherton, O. E., Belanger, A. L., Skulborstad, H. M., Allen, J. M., Banks, J. B., … Nosek, B. A. (2016). Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology, 67, 68–82. doi: 10.1016/j.jesp.2015.10.012
Eich, E. (2014). Business Not as Usual. Psychological Science, 25(1), 3–6. doi: 10.1177/0956797613512465
Falk, R., & Greenbaum, C. (1995). Significance Tests Die Hard - the Amazing Persistence of a Probabilistic Misconception. Theory & Psychology, 5(1), 75–98. doi: 10.1177/0959354395051004
Fanelli, D. (2009). How Many Scientists Fabricate and Falsify Research? A Systematic Review and Meta-Analysis of Survey Data. PLOS ONE, 4(5), e5738. doi: 10.1371/journal.pone.0005738
Fiedler, K., & Schwarz, N. (2016). Questionable Research Practices Revisited. Social Psychological and Personality Science, 7(1), 45–52. doi: 10.1177/1948550615612150
Flore, P. C., Mulder, J., & Wicherts, J. M. (2019). The influence of gender stereotype threat on mathematics test scores of Dutch high school students: A registered report. Comprehensive Results in Social Psychology, 1–35. doi: 10.1080/23743603.2018.1559647
Frank, M. C., & Saxe, R. (2012). Teaching Replication. Perspectives on Psychological Science, 7(6), 600–604. doi: 10.1177/1745691612460686
Fraser, H., Parker, T., Nakagawa, S., Barnett, A., & Fidler, F. (2018). Questionable research practices in ecology and evolution. PLOS ONE, 13(7), e0200303. doi: 10.1371/journal.pone.0200303
Fried, E. I. (2017). The 52 symptoms of major depression: Lack of content overlap among seven common depression scales. Journal of Affective Disorders, 208, 191–197. doi: 10.1016/j.jad.2016.10.019
Friese, M., Loschelder, D. D., Gieseler, K., Frankenbach, J., & Inzlicht, M. (2019). Is Ego Depletion Real? An Analysis of Arguments. Personality and Social Psychology Review, 23(2), 107–131. doi: 10.1177/1088868318762183
Gendron, M., Crivelli, C., & Barrett, L. F. (2018). Universality Reconsidered: Diversity in Making Meaning of Facial Expressions. Current Directions in Psychological Science, 27(4), 211–219. doi: 10.1177/0963721417746794
Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null ritual : What you always wanted to know about significance testing but were afraid to ask. Sage.
Goodman, S. (2008). A dirty dozen: Twelve p-value misconceptions. Seminars in Hematology, 45(3), 135–140. doi: 10.1053/j.seminhematol.2008.04.003
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. doi: 10.1007/s10654-016-0149-3
Hagger, M. S., Chatzisarantis, N. L. D., Alberts, H., Anggono, C. O., Batailler, C., Birt, A. R., … Zwienenberg, M. (2016). A Multilab Preregistered Replication of the Ego-Depletion Effect. Perspectives on Psychological Science: A Journal of the Association for Psychological Science, 11(4), 546–573. doi: 10.1177/1745691616652873
Haller, H., & Kraus, S. (2002). Misinterpretations of significance: A problem students share with their teachers? Methods of Psychological Research, 7(1), 1–20.
Harrington, D., D’Agostino, R. B., Gatsonis, C., Hogan, J. W., Hunter, D. J., Normand, S.-L. T., … Hamel, M. B. (2019). New Guidelines for Statistical Reporting in the Journal. New England Journal of Medicine, 381(3), 285–286. doi: 10.1056/NEJMe1906559
Heathers, J. (2018). Alright, let’s have a roll-call of the big psychology studied that ate their own teeth for one reason or another. SOCIAL PRIMING. Lots of failed repos.http://www.slate.com/articles/health_and_science/science/2014/07/replication_controversy_in_psychology_bullying_file_drawer_effect_blog_posts.html … [Tweet]. https://twitter.com/jamesheathers/status/1006287906087071748.
Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLOS Medicine, 2(8), e124. doi: 10.1371/journal.pmed.0020124
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling. Psychological Science, 23(5), 524–532. doi: 10.1177/0956797611430953
Kaplan, R. M., & Irvin, V. L. (2015). Likelihood of Null Effects of Large NHLBI Clinical Trials Has Increased over Time. PloS One, 10(8), e0132382. doi: 10.1371/journal.pone.0132382
Kerr, N. L. (1998). HARKing: Hypothesizing After the Results are Known. Personality and Social Psychology Review, 2(3), 196–217. doi: 10.1207/s15327957pspr0203_4
Kiers, H., Hoekstra, R., Tendeiro, J., & Van Ravenzwaaij, D. (2019). Unconf - Implications of teaching Bayesian statistics to undergraduate psychology students.
Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B., Bahník, Š., Bernstein, M. J., … Nosek, B. A. (2014). Investigating Variation in Replicability. Social Psychology, 45(3), 142–152. doi: 10.1027/1864-9335/a000178
Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams, R. B., Alper, S., … Nosek, B. A. (2018). Many Labs 2: Investigating Variation in Replicability Across Samples and Settings. Advances in Methods and Practices in Psychological Science, 1(4), 443–490. doi: 10.1177/2515245918810225
Maes, E., Boddez, Y., Alfei, J. M., Krypotos, A.-M., D’Hooge, R., De Houwer, J., & Beckers, T. (2016). The elusive nature of the blocking effect: 15 failures to replicate. Journal of Experimental Psychology. General, 145(9), e49–71. doi: 10.1037/xge0000200
Makel, M. C., Plucker, J. A., & Hegarty, B. (2012). Replications in Psychology Research: How Often Do They Really Occur? Perspectives on Psychological Science, 7(6), 537–542. doi: 10.1177/1745691612460688
Martinson, B. C., Anderson, M. S., & Vries, R. de. (2005). Scientists behaving badly. Nature, 435(7043), 737. doi: 10.1038/435737a
Meehl, P. E. (1967). Theory-Testing in Psychology and Physics: A Methodological Paradox. Philosophy of Science, 34(2), 103–115.
Mobley, A., Linder, S. K., Braeuer, R., Ellis, L. M., & Zwelling, L. (2013). A Survey on Data Reproducibility in Cancer Research Provides Insights into Our Limited Ability to Translate Findings from the Laboratory to the Clinic. PLOS ONE, 8(5), e63221. doi: 10.1371/journal.pone.0063221
Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C., Percie du Sert, N., … Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1(1), 0021. doi: 10.1038/s41562-016-0021
Nosek, B. A., & Lakens, D. (2014). Registered reports: A method to increase the credibility of published results. Social Psychology, 45(3), 137–141. doi: 10.1027/1864-9335/a000192
Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific Utopia: II. Restructuring Incentives and Practices to Promote Truth Over Publishability. Perspectives on Psychological Science, 7(6), 615–631. doi: 10.1177/1745691612459058
Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2016). The prevalence of statistical reporting errors in psychology (19852013). Behavior Research Methods, 48(4), 1205–1226. doi: 10.3758/s13428-015-0664-2
Oakes, M. W. (1986). Statistical inference : A commentary for the social and behavioural sciences. Chichester: John Wiley & Sons.
Oostenbroek, J., Suddendorf, T., Nielsen, M., Redshaw, J., Kennedy-Costantini, S., Davis, J., … Slaughter, V. (2016). Comprehensive Longitudinal Study Challenges the Existence of Neonatal Imitation in Humans. Current Biology, 26(10), 1334–1338. doi: 10.1016/j.cub.2016.03.047
OSC. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. doi: 10.1126/science.aac4716
Ranehill, E., Dreber, A., Johannesson, M., Leiberg, S., Sul, S., & Weber, R. A. (2015). Assessing the Robustness of Power Posing: No Effect on Hormones and Risk Tolerance in a Large Sample of Men and Women. Psychological Science, 26(5), 653–656. doi: 10.1177/0956797614553946
Reicher, S., & Haslam, S. A. (2006). Rethinking the psychology of tyranny: The BBC prison study. British Journal of Social Psychology, 45(1), 1–40. doi: 10.1348/014466605X48998
Ritchie, S. J., Wiseman, R., & French, C. C. (2012). Failing the Future: Three Unsuccessful Attempts to Replicate Bem’s “Retroactive Facilitation of Recall” Effect. PLoS ONE, 7(3). doi: 10.1371/journal.pone.0033423
Sarafoglou, A., Hoogeveen, S., Matzke, D., & Wagenmakers, E.-J. (2019). Teaching Good Research Practices: Protocol of a Research Master Course. Psychology Learning & Teaching, 1475725719858807. doi: 10.1177/1475725719858807
Schimmack, U. (2015). Questionable Research Practices: Definition, Detect, and Recommendations for Better Practices. https://replicationindex.com/2015/01/24/questionable-research-practices-definition-detect-and-recommendations-for-better-practices/.
Schönbrodt, F. (2015a). Questionable Research Practices. https://osf.io/bh7zv/.
Schönbrodt, F. (2015b). Red flags: How to detect publication bias and p-hacking. https://osf.io/cz7ht/.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22(11), 1359–1366. doi: 10.1177/0956797611417632
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014a). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143(2), 534–547. doi: 10.1037/a0033242
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014b). P-Curve and Effect Size: Correcting for Publication Bias Using Only Significant Results. Perspectives on Psychological Science: A Journal of the Association for Psychological Science, 9(6), 666–681. doi: 10.1177/1745691614553988
Spreckelsen, T. F. (2018). Editorial: Changes in the field: Banning p-values (or not), transparency, and the opportunities of a renewed discussion on rigorous (quantitative) research. Child and Adolescent Mental Health, 23(2), 61–62. doi: 10.1111/camh.12277
Steele, K. M., Bass, K. E., & Crook, M. D. (1999). The Mystery of the Mozart Effect: Failure to Replicate. Psychological Science, 10(4), 366–369. doi: 10.1111/1467-9280.00169
Trafimow, D., & Marks, M. (2015). Editorial. Basic and Applied Social Psychology, 37(1), 1–2. doi: 10.1080/01973533.2015.1012991
Vadillo, M. A., Gold, N., & Osman, M. (2018). Searching for the bottom of the ego well: Failure to uncover ego depletion in Many Labs 3. Royal Society Open Science, 5(8), 180390. doi: 10.1098/rsos.180390
van der Zee, T., Anaya, J., & Brown, N. J. L. (2017). Statistical heartburn: An attempt to digest four pizza publications from the Cornell Food and Brand Lab. BMC Nutrition, 3(1), 54. doi: 10.1186/s40795-017-0167-x
Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams, R. B., … Zwaan, R. A. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11(6), 917–928. doi: 10.1177/1745691616674458
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician, 70(2), 129–133. doi: 10.1080/00031305.2016.1154108
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a World Beyond “p \(<\) 0.05”. The American Statistician, 73(sup1), 1–19. doi: 10.1080/00031305.2019.1583913
Watts, T. W., Duncan, G. J., & Quan, H. (2018). Revisiting the Marshmallow Test: A Conceptual Replication Investigating Links Between Early Delay of Gratification and Later Outcomes. Psychological Science, 29(7), 1159–1177. doi: 10.1177/0956797618761661
Wicherts, J. M., Veldkamp, C. L. S., Augusteijn, H. E. M., Bakker, M., van Aert, R. C. M., & van Assen, M. A. L. M. (2016). Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking. Frontiers in Psychology, 7. doi: 10.3389/fpsyg.2016.01832